QTM 447 Lecture 1: Introduction

Kevin McAlister

January 14, 2025

Welcome to QTM 447!

This is a class that is going to provide an overview of the statistical methods that underpin a number of more advanced machine learning techniques

  • The goal of this class is to provide a statistical/mathematical introduction to topics in machine learning that extend on those methods learned in your introductory class

Class flow:

  • Part 1: Review of some important topics, regression methods and regularization, first-order optimization methods

  • Part 2: Neural Networks and Deep Learning, NNs for tabular data, CNNs for images, RNNs and Transformers for sequences

  • Part 3: Generative Machine Learning, basics of generation methods, PCA and Autoencoders, Autoregressive Models, Generative Adversarial Networks

Pre-requisites

For this class, I am expecting that students have taken a class on:

  • Linear regression (Like QTM 220)

  • Machine Learning (like QTM 347)

  • Calculus 1-3

  • Linear algebra

I don’t expect you to remember everything from these classes, but I do expect that when you see certain concepts they aren’t completely foreign.

Lectures

We’re going to meet here on Tuesdays and Thursdays from 2:30 PM - 3:45 PM

  • There is no formal attendance requirement for this course

  • That said, attendance is strongly encouraged (read basically required)

  • It’s not so much attending, it’s more so committing 90 minutes twice a week to this class’ materials

Lectures

During lectures:

  • I expect that students who attend will be ready to participate (ask questions, answer check-in questions, etc.)

  • Be willing to stop me if there’s something that isn’t clear. If it’s not clear to you, then I’m sure there’s someone else in class who is also confused.

  • It’s a relatively small class, so I have no problem spending time re-clarifying points that may not have been well made.

Lectures

I will also be simulcasting the lectures on Zoom in the class Zoom room posted to Canvas.

  • However, I will not be recording these lectures

  • If, for some reason, you need me to record, let me know beforehand and I’ll decide if it is appropriate.

All lecture slides will be posted before class.

  • I’ll also try to post lecture notes for various topics.

Office Hours

I’ll have two office hours periods this semester:

  • Mondays 4:00 - 5:15ish

  • Wednesdays from 5:30 - 6:30

All office hours will occur in my office - PAIS 579

  • No plans for Zoom office hours - can make special arrangements as needed

Group Problem Sets

Over the course of the semester, there will be 6-8 problems sets. These will account for 50% of your final grade.

  • Implement and extend the materials discussed in class.

  • Introduce software and coding

  • Derivations and Proofs

Problem Sets will be where you get your applied practice

  • Lectures will largely center on the theory of why things work

  • Problem sets will relate more to the how

Group Problem Sets

Each problem set will be 2-3 questions

  • Time required will largely depend on level of comfort with coding!

  • Really good applied practice to beef up your coding skills.

Problem set solutions should be written up in some form of Markdown

  • Quarto or a Jupyter Notebook is probably the best for this

  • Weaves code, text, and Latex together seamlessly

Group Problem Sets

All problem sets should be submitted to the appropriate Canvas assignment as two files:

  • The raw notebook file (.rmd, .qmd, .ipynb)

  • A rendered version of the notebook (.html or .pdf)

Each problem set will be posted in a variety of different formats, so feel free to use this as a template for your final solutions.

Group Problem Sets

All problem sets in this class can be completed in groups of, at most, 3 students

  • Each student should turn in a copy of the solutions, but can be identical to group’s solutions.

  • All collaborators must be outlined at the top in the by-line

  • I recommend the same group each time - repeated vs. one-shot games of trust.

Since it is an upper level course, I’m creating a mechanism for getting rid of shirkers

  • If they don’t contribute equally, get rid of ’em.

You can complete assignments individually.

I can also assign you a group, if you’d like. Just email me and I’ll put people together who want a group.

Group Problem Sets

Each students gets one freebie for a problem set. Can be turned in up to 5 days after the due date with no penalty.

After the freebie, late assignments will receive a 10% per day deduction

  • There can be exceptions, but these will be hard to receive

  • Plan ahead accordingly

Final Project

The other 50% of your final grade will be determined by a final project

  • A significant project that applies methods discussed in this class

  • Up to you (and your group) exactly what this is

Do something interesting!

  • Projects that are too simple will not be accepted

  • No basic comparisons of methods

  • Really try to answer a question that’s interesting to you

Final Project

Three checkpoints:

  • On March 25th, teams will present a single slide outlining their final project (5%)

  • On April 25th, teams will present project posters at QTM’s end of semester showcase (20%)

  • By May 7th, each student should submit a final paper about the final project. A scientific-styled paper no more than 15 pages. (25%)

Final Project

The goal of this final project is to give you something to include in your portfolio as you apply for jobs or grad school

  • Maybe start something that turns into a bigger project later

There aren’t a lot of assignments in this class because I really want you to put a lot of effort into this project!

Textbooks

There will be five textbooks we’ll use in this class:

  • Probabilistic Machine Learning: An Introduction (PML1)

  • Probabilistic Machine Learning: Advanced Topics (PML2)

  • Deep Learning (DL)

  • Understanding Deep Learning (UDL)

  • Elements of Statistical Learning (ESL)

Each book is freely available online and posted to Canvas.

  • I’ll post corresponding chapters for each topic in the weekly modules

  • I would recommend using lectures as a starting point and reading chapters to fill in gaps

  • Many topics in these books that we won’t cover!

Coding

Modern statistical machine learning is equal parts pen-and-paper and computational implementation.

  • Very few things in this class will be fully solvable via pen and paper!

A large portion of this course revolves around programming and computing.

  • The lectures are going to talk about this, but the majority of the applications will be on you to figure out in your problem sets

  • Reading documentation is a skill!

  • Understanding how to write functions is key. Being able to build an algorithm (even if it’s not elegant) is a useful skill

Coding

I’ll be using Python for this class.

You can technically use any language you want for this class

  • But, I highly recommend Python

  • Deep learning libraries are developed mostly with Python in mind

Coding

My setup:

  • Python 3.10

  • VSCode

  • Jupyter Notebooks/Quarto

My local hardware:

  • AMD Ryzen 5900x - 12 core/24 thread processor

  • NVIDIA RTX 3090TI - 10,752 CUDA Cores/24GB GDDR6 VRAM

Coding

Please use Github Copilot/ChatGPT4o!

  • This is a machine learning course an these LLMs are a modern marvel!

So many of the annoyances of coding with Python are gone when you use code LLMs

  • Matplotlib syntax

  • Documentation searches

  • scikit-learn nuances

If you have a Github account, get it student certified and you’ll be able to use Copilot for free

  • I think the $20 ChatGPT pro license is money well spent

Coding

As we progress through this class, we’re going to get to methods that are computationally demanding

  • Often too demanding for your laptop

If you don’t have a machine with a discrete GPU, you’ll be using Google Colab

  • I highly recommend paying $10 a month for Colab Pro during this class

  • Gives access to better GPUs and priority time on them

  • If $10 a month presents a problem, let me know.

Coding

We’ll be using PyTorch throughout this class

  • Software for general purpose optimization

  • The dominant software for deep learning

Using a nice utility functions from PyTorch Lightning

  • A really nice library that sits over PyTorch that organizes code

  • Auotmatically detects and uses GPUs and other accelerators when available

The Goals of this Class

The topics I hope to cover this semester are outlined on the syllabus.

  • The pace is ambitious.

  • If we need to take more time on topics, we will adjust.

  • Some stretch goals on GANs and VAEs at the end of the semester. Will replace as needed.

The Goals of this Class

Any questions?

Machine Learning

What is a machine learning algorithm?

I really like this definition (Mitchell, 1997):

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

  • In other words, a machine learning algorithm learns from data - the more data and the more trials, the better

  • We would hope that the algorithm is able to perform well at a defined task given some data

Tasks

Machine learning is so popular because it can be used to address many different questions/achieve many different tasks.

Two basic supervised starting points:

Regression - given a vector of input features, \mathbf X, and an outcome vector, \mathbf y \in \mathbb R, output a function \hat{f}(\mathbf x) that accurately predicts the value of the outcome given \mathbf x.

  • Use inputs to predict the value of a continuous outcome.

What are some questions we can answer with regression algorithms?

Tasks

Machine learning is so popular because it can be used to address many different questions/achieve many different tasks.

Two basic supervised starting points:

Classification - given a vector of input features, \mathbf X, and an outcome vector, \mathbf y \in (1,2,...,K), output a function \hat{f}(\mathbf x) that accurately predicts the class given \mathbf x.

  • Use inputs to predict the value of a countable and finite set of outcomes.

What are some questions we can answer with classification algorithms?

Tasks

These two basic tasks describe much of machine learning!

In your introductory machine learning class, the methods discussed largely centered on solving these kinds of problems.

  • We have a feature matrix ( N observations of P features), \mathbf X

  • We have an outcome vector ( N observations of the outcome), \mathbf y

  • Use \mathbf X and \mathbf y to learn \hat{f}(\mathbf x) that maps features to a predicted outcome.

Tasks

However, there are many more interesting tasks that can be solved using machine learning!

Classification with Missing Inputs:

Given a feature matrix, \mathbf X, predict the associated class for a new feature vector, \mathbf x_0.

However, it is not guaranteed that \mathbf X or \mathbf x_0 has all of the features

Tasks

Let’s suppose that we are attempting to learn a binary class using logistic regression:

Given \mathbf X and \mathbf y, define our hypothesis class as the set of functions covered by:

Pr(y = 1 | \mathbf x) = \frac{exp[\mathbf x^T \hat{\boldsymbol \beta}]}{1 + exp[\mathbf x^T \hat{\boldsymbol \beta}]}

The goal:

Find \hat{\boldsymbol \beta} that minimizes the cross-entropy generalization error (or log loss) given the training data.

Tasks

The issue: if \mathbf x_0 has missing values, we can’t assess the function output!

  • NA doesn’t correspond to a number…

The tools discussed in your intro class can’t handle this!

Tasks

A proposal:

Assume that each \mathbf x is a draw from f(\mathbf X) - a P dimensional proper probability density function.

Also, each y is a draw from the proper probability mass function f(y | \mathbf x)

Discriminative models learn the conditional density of y given \mathbf x

  • Logistic regression

Generative models learn the joint density of \mathbf x and y - typically via f(y , \mathbf x) = f(y | \mathbf x)f(\mathbf x).

  • Remember LDA, QDA, and Naive Bayes?

  • These models are too simple, though, for many cases!

Tasks

If we could adequately jointly learn f(\mathbf X) and f(y | \mathbf x), then we could fill in the missing information with the most likely values given what we do observe.

  • A better method than just assuming a simple multivariate normal distribution over the features

The toolkit you currently have in unequipped to do this in a meaningfully flexible way!

Note that it is really common to have missing inputs

  • Medical diagnosis - only a few tests are run for each person, but different tests across the population

  • Survey responses - people aren’t required to fill out every answer

  • Missing matchups in college basketball - if we want to predict who will win the NCAA tournament, every team doesn’t play in the regular season

Tasks

Structured Inputs/Outputs

Suppose our outcomes of interest are not just a number or a class.

Rather, it’s a sequence of some sort:

  • Pixels in an image

  • A sentence

How do we use regression tools to predict an outcome of this sort?

Tasks

One approach is to develop different functions for each independent output (each pixel in an image, for example)

  • Problem?

Each output heavily depends on the others!

Is it possible to learn about many different correlated outcomes at once?

Tasks

Synthesis

Given a training set, \mathbf X, generate new examples that are similar to those in the training data

  • Not only similar, but coherent!

Think DALL-E, the image generation software

Or Chat-GPT which provides coherent textual answers to questions that it may not have seen in the training set

Tasks

Tasks

Tasks

Tasks

These methods are getting really good!

Let’s play a game:

Which Face is Real?

Tasks

Even music can be convincingly generated using generative methods these days!

Note: This is an at home listen. Also, apologies for the language. But, this is the best AI generated music in the game right now!

The Bottom 1

The Bottom 2

The Bottom 3

Tasks

A formal statement:

We believe that images/text/audio are generated from a true joint distribution over pixels, \mathbf X

Each example we see, \mathbf x_i, is a sample from this distribution

Goal: Learn f(\mathbf X) in such a way that:

  • It sufficiently captures complex dependencies between inputs

  • It can be sampled from to generate new coherent objects

  • It can be modified to emphasize desired features

Tasks

There are many other important tasks that machine learning can achieve that don’t exactly fit into the regression or classification bucket:

  • Machine translation

  • Anomaly detection

  • Denoising

  • Standard density estimation

  • Many, many more

Tasks

These tasks tend to:

  • Have complicated mappings of complicated inputs to complicated outputs

  • Have high-dimensional feature sets

  • Involve learning something about the distribution of the features

In other words, many interesting tasks are more complex than tasks that can be addressed via basic machine learning methods!

Methods

In your previous classes, you’ve likely covered a number of sophisticated methods.

However, these methods are limited by their rigidity

  • The types of f(y | \mathbf x) that can be uncovered is limited in complexity

  • Or the type of y is limited to standard scalar outputs

  • Or the computational complexity of the method is extremely high

And very few are capable of jointly learning f(\mathbf x) and f(y | \mathbf x)

Methods

Linear Regression

\hat{y} = \mathbf x^T \boldsymbol \beta

can only uncover functions that are linear combinations of the input features

  • We can add nonlinear terms, but we’re still limited to additive combinations

  • The more features, the more terms we need to compute if we’re adding complexity via this approach

Note that logistic regression is just linear regression with a twist!

Methods

Linear Regression

We can add complexity to the linear model a number of different ways:

  • Global polynomials (curse of dimensionality)

  • Regularization terms like Ridge and LASSO (creates simpler linear models)

  • Splines (don’t generalize to high dimensional problems)

None of these approaches allow us to really deal with really complex structures!

Methods

Tree-based Methods

\hat{y} = Avg(y \in R)

require no specific functional form, but are limited to scalar y values

  • Sufficiently flexible in functional form

  • Unable to handle complex inputs and outputs

Methods

Kernel-based Approaches

\hat{y} = K(\mathbf x, \mathbf X)

can uncover any possible functional form with an appropriate kernel (the RBF kernel, for example)

However, computation for kernel based methods can be extremely costly when N is large

  • Need to compute a N \times N kernel matrix

Methods

Unfortunately, our toolkit is largely unable to deal with real complex problems!

But, we can get around that by introducing deep learning into the toolkit

  • A method that allows us to make regression quite flexible

  • And scalable to large data sets

Methods

Based largely around the concept of neural networks:

\hat{y} = \phi(\theta_1\phi(\theta_2(...)))

Deep learning presents a scalable way to address really complicated problems!

Additionally, encoder/decoder architectures allow us to start thinking about meaningful ways to learn about f(\mathbf x), as well.

Before We Get There

Before we explore neural networks, though:

  • Spend some time reviewing machine learning and predictive modeling basics

  • Review regression methods and regularization

  • Discuss convex optimization methods

Added together, we’ll have a good underpinning for really understanding and appreciating how deep learning works!

Before We Get There

My thought:

It takes a week or two to learn how to use PyTorch - you could do that on your own

It takes a lot longer to really understand why neural networks and deep learning work the way that they do

  • A lot of math and statistics

  • Really appreciate the modern marvel of LLMs and how quickly they can provide coherent answers!

Prerequisite Topics

  1. Generalization and Overfitting
  2. A unified chain view of regression
  3. First order optimization methods

Deep Learning

  1. Deep Learning for Tabular Data
  2. Deep Learning for Image inputs (and outputs)
  3. Deep learning for Sequences/Text

Generative Methods

  1. Intro to Bayesian Machine Learning
  2. Autoencoders
  3. Generative Adversarial Networks
  4. Normalizing Flows
  5. Diffusion Models

Next Time

A review of generalization

  • Underfitting and overfitting

  • Computing generalization error